Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for abfss://[email protected]/path URL syntax #72

Merged
merged 3 commits into from
Aug 30, 2024

Conversation

gdubya
Copy link

@gdubya gdubya commented Aug 18, 2024

While playing with the uc_catalog and delta extensions I encountered an error while trying to read from an ADLS gen2 account using the abfss syntax in this format: abfss://[email protected]/path.

This PR corrects the URL parser to account for this additional variation.

Linked issues:

  1. Azure ADLS Gen2 with HNS: Unable to delta_scan FQDN abfss URL duckdb-delta#71
  2. Azure support? uc_catalog#7
  3. IO Error: AzureBlobStorageFileSystem open file error reading Iceberg written by Spark and Nessie catalog #71

@sugibuchi
Copy link

Could you please consider supporting abfs:// in addition to abfss:// as well?

abfs:// is commonly used with the Hadoop ABFS support (the original implementation of ABFS). There is no difference between abfs:// and abfss:// as the Hadoop ABFS support enforces TLS by default.

@gdubya
Copy link
Author

gdubya commented Aug 20, 2024

@sugibuchi I'm happy to try adding that, but I would like to do it as a separate PR to limit the scope of what's affected here. The abfss string is used quite a few places so it will be a bit of work to add support for abfs as well, as simple as it sounds.

@gdubya
Copy link
Author

gdubya commented Aug 20, 2024

@samansmink Could you have a look at this PR and merge if all looks ok? Apologies if it's not you that I should ping. I just looked at the other open PR. Maybe there should be a CODEOWNERS file in the repo?

@quentingodeau
Copy link
Contributor

Hello,
I had in the past allow the syntax that you are proposing here but remove it because how scope for secret work. The scope when defined on secret return the score with the path that the score is simply the common distance between path and scope.
So if you want to define a secret for a all storage account with the container@sa you cannot anymore.

From my point a view I do not think that a good idea but maybe I a particular user that use too many storage account.

A solution for this can also be to change the way secrets score is compute and allow wildcard.

Regards
Quentin

@gdubya
Copy link
Author

gdubya commented Aug 21, 2024

@quentingodeau I see what you mean. But given how important support for this syntax is, would it be ok to accept this change and document the limitations with regards to the scopes? This change should not affect existing functionality, only add some more functionality (which is slightly limited with regards to the scopes, but better than nothing, IMHO).

@quentingodeau
Copy link
Contributor

Yes all abfss path are in fact mostly manage this way. I was just to point out the limitations of the syntax. I really think that we should adapt the way score are compute in the secret management.
I have not enough time at the moment but if it is not done before I will try to take a look.
Also at the end I think we should only keep one syntax (but that only my opinion)
Regards

@sugibuchi
Copy link

@gdubya

I'm happy to try adding that, but I would like to do it as a separate PR

OK. I understand and agree to this.

@sugibuchi
Copy link

Also at the end I think we should only keep one syntax (but that only my opinion)

If we should keep only one syntax, please consider keeping the Hadoop compatible syntax.

For historical reasons, each ABFS filesystem implementation supports a different set of ABFS URL syntax.

Apache Hadoop's hadoop-azure package link

  • adfs[s]://<container>@<account>.dsf.core.windows.net/path/to/file

Python adlfs for fsspec link

  • Hadoop-compatible URL syntax , and
  • az://<container>/path/to/file
  • adfs[s]://<container>/path/to/file

Rust object_store::azure link

  • Hadoop-compatible URL syntax, and
  • adlfs-compatible URL syntax , and
  • azure://<container>/path/to/file
  • https://<account>.blob.core.windows.net/<container>/path/to/file
  • https://<account>.dfs.core.windows.net/<container>/path/to/file

Apache Arrow link

  • Hadoop-compatible URL syntax , and
  • abfs[s]://[:<password>@]<account>.blob.core.windows.net/<container>/path/to/file
  • abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net/path/to/file
  • abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>]/<container>/path/to/file
  • abfs[s]://[<account[:<password>]@]<container>[/path]

DuckDB azure extension link

  • az://<container>/path/to/file
  • az://<account>.blob.core.windows.net/<container>/path/to/file
  • adfss://<container>/path/to/file
  • adfss://<account>.dsf.core.windows.net/<container>/path/to/file

This inconsistency of supported ABFS URL syntax is a source of headaches, particularly when we combine different implementations of the ABFS file system (for example, extracting data from a massive table using Spark, then interactively analysing it with Polars, etc.).

Fortunately, the ABFS URL syntax defined by Hadoop's hadoop-azure (the original ABFS implementation) can be a safe choice when we combine different ABFS implementations as this syntax is widely supported by almost all ABFS implementations, except the DuckDB Azure extension.

To maximize interoperability with other frameworks/libraries, the support of the Hadoop-compatible syntax should be a high priority.

@gdubya
Copy link
Author

gdubya commented Aug 22, 2024

@sugibuchi Thank you for the summary of the various options. It is quite a mess! I agree with the conclusion that the ABFS syntax should be the main priority.

Hopefully we can merge this PR now, and then any decisions made about futher changes can be handled in a different one?

Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I will fix CI on main then merge that into this PR so that CI can run properly here

@samansmink
Copy link
Collaborator

all green 🟢

@samansmink samansmink merged commit eddc484 into duckdb:main Aug 30, 2024
18 checks passed
@samansmink
Copy link
Collaborator

@sugibuchi your suggestion is now implemented

thanks @gdubya!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants